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Q ■ Abstract 

In science and engineering, intelligent processing of complex signals such as images, sound or language 
is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically 
, inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex 

■ mappings using simple stages. Each layer performs a different operation and achieves an ever more 
sophisticated representation of the input, as, for example, in an deep artificial neural network, an ob- 
ject recognition cascade in computer vision or a speech front-end processing. Joint estimation of the 

\»«/ ■ parameters of all the layers and selection of an optimal architecture is widely considered to be a dif- 

l_J ' ficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed 

■ computation environment, and requiring significant human expert effort, which leads to suboptimal sys- 
' terns in practice. We describe a general mathematical strategy to learn the parameters and, to some 

extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This re- 
places the original problem involving a deeply nested function with a constrained problem involving a 
different function in an augmented space without nesting. The constrained problem may be solved with 
penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. 

■ MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can 
^\ ' be parallelized trivially and massively, applies even when parameter derivatives are not available or not 
I/") . desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation 

' setting, often providing reasonable models within a few iterations. 

1 Introduction 

The continued increase in recent years in data availability and processing power has enabled the development 
and practical applicability of ever more powerful models in stati s tical machine learning, for example to 
recognize faces or speech, or to translate natural language (|Bishod . l200l . However, physical limitations in 



serial computation suggest that scalable processing will require algorithms that can be massively parallelized, 
so they can profit from the thousands of inexpensive processors available in cloud computing. We focus on 
hierarchical processing architectures such as deep neural nets (fig. [IJ, which w ere originally inspired by 



biological systems such as the visual and auditory cortex in the mammalian brain ([Riesenhuber and Poggio , 
19991 : ISerre et all l2007t iGold and Morgan! . 120001 ) , and which have been proven very successful at learning 



sophisticated tasks, such as recognizing faces or speech, when trained on data. A typical neural net defines 
a hierarchical, feedforward, parametric mapping from inputs to outputs. The parameters (weights) are 
learned given a dataset by numerically minimizing an objective function. The outputs of the hidden units 
at each layer are obtained by transforming the previous layer's outputs by a linear operation with the 
layer's weights followed by a nonlinear elementwise mapping (e.g. sigmoid). Deep, nonlinear neural nets 
are universal approximators, that is, they can app roximate any target mapping (from a wide class) to 



-ppr 

arbit rary accuracy given e nough units (|Bishop| . 120061 ) . and can have more representation power than shallow 



nets ( Bengio and LeCun . 2007t) . The hidden units may encode hierarchical, distributed features that are 



useful to deal with complex sensory data. For example, when trained on images, deep nets can learn low- 
level features such as edges and T-junctions and high-level features such as parts decompositions. Other 
examples of hierarchical processing systems are cascades for object recognition and scene understanding in 
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Figure 1: Net with K = 3 hidden layers (z^: auxiliary coordinates, weights). 



computer vision ( Serre et al. L 2007t Ranzato et al. , 2007a ) or for phoneme classification in speech processing 



( Gold and Morgan . 20001 ISaon and Chienl. l2012n. wrapper approaches t o classification or regression (e.g 



based on dim ensionality reduction; Wang and Carreira-Perpifianl 2012 ). or kinematic chains in robotics 



(|Craigl . Eool . These and other architectures share a fundamental design principle: mathematically, they 



construct a deeply nested mapping from inputs to outputs. 

The ideal performance of a nested system arises when all the parameters at all layers are jointly trained 
to minimize an objective function for the desired task, such as classification error (indeed, there is evidence 
that plasticity and learning probably occurs at all stages of the ventral stream of primate visual cortex; 
Riesenhuber and Poggiol[l999t Serre et al. . 2007 ). However, this is challenging because nesting (i.e., function 
composition) prod uces inherently nonconvex functions. Jo int training is usually done with the backprop- 



agation algorithm ( Rumelhart et al. , 1986a ; Werbosl . 1974 ). which recursively computes the gradient with 



respect to each parameter using the chain rule. One can then simply update the parameters with a small 
step in the negative gradient direction as in gradient descent and stochastic gradient descent (SGD), or feed 
the gradient to a nonlinea r optimization method that will compute a better search direction, possibly using 



second-order information (jOrr and Mullerl . 1199 8V This process is repeated until a converge nce criterion is 



satisfied. Backprop in an y of these variants suffers from the problem of vanishing gradients ( Rognvaldssonl 



1994 lErhan et al 



20091) . where the gradients for lower layers are much smaller than those for higher lay- 
ers, which leads to tiny steps, slowly zigzagging down a curved valley, and a very slow convergence. This 
problem worsens with the depth of the net and led researchers in the 1990s to give up in practice with 
nets beyond around two hidden layers (with special architectures such as convo lutional nets (iLeCun et al ^ 



1998) being an exception ) until recently, when improved initialization strategics (IHinton and Salakhutdinov 



2006t iBengio et al.l . l2007h and much faster computers — but not really any improvement in the optimization 
algorithms themselves — have renewed interest in deep architectures. Besides, backprop does not parallelize 
over layers (and, with nonconvex problems, is hard to parallelize over minibatches if using SGD), is only 
applicable if the mappings are differentiable with respect to the parameters, and needs careful tuning of 
learning rates. In summary, after decades of research in neural net optimization, simple backprop-based 
algorithms such as stochastic gradient descen t remain the state-of-the-art, particular ly when combined with 
good initialization strategies ( Orr and Midler , 19981 : Hinton and Salakhutdinov , 2006). In addition, selecting 
the best architecture, for example the number of units in each layer of a deep net, or the number of filterbanks 
in a speech front-end processing, requires a combinatorial search. In practice, this is approximated with a 
manual trial-and-error procedure that is very costly in effort and expertise required, and leads to suboptimal 
solutions (where often the parameters of each layer are set irrespective of the rest of the cascade). 

We describe a general optimization strategy for deeply nested functions that we call method of auxiliary 
coordinates (MAC), which partly alleviates the vanishing gradients problem, has embarrassing parallelization, 
and can reuse existing algorithms (possibly not gradient-based) that optimize single layers or individual units. 
Section[2]describes MAC, section [3] describes related work, section 2] gives experimental results that illustrate 
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the different advantages of MAC, and the appendix gives formal theorem statements and proofs. 



2 The method of auxiliary coordinates (MAC) 

2.1 The nested objective function 

For definitcness, we describe the approach for a deep net such as that of fig. [TJ Later sections will show other 
settings. Consider a regression problem of mapping inputs x to outputs y (both high-dimensional) with a 
deep net f(x) given a dataset of N pairs (x n ,y n ). A typical objective function to learn a deep net with K 
hidden layers has the form (to simplify notation, we ignore bias parameters): 

1 N 

^i( W ) = 2Elly«- f ( x - W )H 2 f(x;W) =f A - +1 (...f 2 (f 1 (x;W 1 );W 2 )...;W /f+1 ) (1) 

n=l 

where each layer function has the form ffc(x; W&) = er(Wfcx), i.e., a linear mapping followed by a squashing 
nonlinearity (<r(t) applies a scalar function, such as the sigmoid 1/ (l + e~*), elementwise to a vector argument, 
with output in [0, 1]). Our method applies to loss functions other than squared error (e.g. cross-entropy for 
classification), with fully or sparsely connected layers each with a different number of hidden units, with 
weights shared across layers, and with regularization terms on the weights W^. The basic issue is the deep 
nesting of the mapping f . The traditional way to minimize ((T|) is by computing the gradient over all weights 
of the net using backpropagation and feeding it to a nonlinear optimization method. 

2.2 The method of auxiliary coordinates (MAC) 

We introduce one auxiliary variable per data point and per hidden unit and define the following equality- 
constrained optimization problem: 

i?(W,Z) = ^|| y „-f i , +1 (z, f ,„;W if+1 )|| 2 s.t. \. K : n K '\n = l,...,N. (2) 

2 t^i (zi : „ = fi(x„; Wi) J 

Each Zk t n can be seen as the coordinates of x„ on an intermediate feature space, or as the hidden unit acti- 
vations for x ra . Intuitively, by eliminating Z we see this is equivalent to the nested problem (JlJ ; we can prove 
under very general assumptions that both problems have exactly the same minimizers (see appendix IA.2[) . 
Problem ([2]) seems more complicated (more variables and constraints) , but each of its terms (objective and 
constraints) involve only a small subset of parameters and no nested functions. Below we show this reduces 
the ill-conditioning caused by the nesting, and partially decouples many variables, affording an efficient and 
distributed optimization. 



2.3 MAC with quadratic-penalty (QP) optimization 

The problem ([2]) may be solved with a number of constrained optimization approac hes. To illustrate the 



advan tages of MAC in the simplest way, we use the quadratic-penalty (QP) method (jNocedal and Wright 



20061 ). We optimize the following function over (W, Z) for fixed /i > and drive \i — > oo: 



N N K 

E Q (W, Z; M) = 2 E Hy» - fK + i(zK, n ;W K+1 )\\ 2 + §E E H Zfc -« - ffc(zfc-i,„; W fe )!| 2 . (3) 

= 1 n=lfe=l 



This defines a continuous path (W*(/z), Z*(/x)) which, under some mild assumptions (see proof in ap- 
pendix [B]), converges to a minimum of the constrained problem ([2]), and thus to a minimum of the original 
problem ([T|). In practice, we follow this path loosely. 

The QP objective function can be seen as breaking the functional dependences in the nested mapping f 
and unfolding it over layers. Every squared term involves only a shallow mapping; all variables (W, Z) are 
equally scaled, which improves the conditioning of the problem; and the derivatives required are simpler: we 
require no backpropagatcd gradients over W, and sometimes no gradients over at all. 

We now apply alternating optimization of the QP objective over Z and W: 
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W-step Minimizing over W for fixed Z results in a separate minimization over the weights of each hidden 
unit — each a single-layer, single-unit problem that can be solved with existing algorithms. Specifi- 
cally, for the unit (k,h), for k = 1, . . . , K + 1 (where we define zk+i,u = Yn) and h — l,...,Hk 
(assuming there are units in layer k), we have a nonlinear, least-squares regression of the form 
min w fc)1 Z)„=i i z kh,n - fkh( z k-i,n, Wfeh)) 2 , where w fc/i is the weight vector (hth row of W fc ) that feeds 
into the hth output unit of layer k, and Zkh,n the corresponding scalar target for point x„. 

Z-step Minimizing over Z for fixed W separates over the coordinates z n for each data point n = 1, . . . , N 
(omitting the subindex n and weights): 

min \ ||y - f K+ i(z K )f + | £ ||z fe - f fe ( Zfc _ a )|| 2 (4) 

fe=i 

and can be solved using the derivatives w.r.t. z of the single- layer functions fi, . . . , fA'+i- 

Thus, the W-stcp results in many independent, single-layer single-unit problems that can be solved with 
existing algorithms, without extra programmi ng cost. The Z-step is new, however it always ha s the same 
form ((4]) of a "generalized" proximal operator ( Rockafellar , 19761 Combettes and Pesauet , 2011 ) . MAC re- 



duces a complex, highly-coupled problem — training a deep net — to a sequence of simple, uncoupled problems 
(the W-step) which are coordinated through the auxiliary variables (the Z-step). For a large net with a large 
dataset, this affords an enormous potential for parallel, distributed computation. And, because each W- 
or Z-step operates over very large, decoupled blocks of variables, the decrease in the QP objective function 
is large in each iteration, unlike the tiny decreases achieved in the nested function. These large steps are 
effectively shortcuts through (W, Z)-space, instead of tiny steps along a curved valley in W-space. 

Rather than an algorithm, the method of auxiliary coordinates is a mathematical device to design opti- 
mization algorithms suited for any specific nested architecture, that are provably convergent, highly paral- 
lelizable and reuse existing algorithms for non-nested (or shallow) architectures. The key idea is the judicious 
elimination of subexpressions in a nested function via equality constraints. The architecture need not be 
strictly feedforward (e.g. recurrent nets). The designer need not introduce auxiliary coordinates at every 
layer: there is a spectrum between no auxiliary coordinates (full nesting), through hybrids that use some 
auxiliary coordinates and some semi-deep nets, to every single hidden unit having an auxiliary coordinate. 
An auxiliary coordinate may replace any subexpression of the nested function (e.g. the input to a hidden unit, 
rather than its output, leading to a quadratic W-step). Other methods for constrained optimization may be 
used (e.g. the augmented Lagrangian rather than the quadratic-penalty method). Depending on the charac- 
teristics of the problem, the W- and Z-steps may be solved with any of a number of nonlinear optimization 
methods, from gradient descent to Newton's method, and using standard techniques such as warm starts, 
caching factorizations, inexact steps, stochastic updates using data minibatches, etc. In t his respect, MAC 



is sim ilar to other "mctaalgorithms" such as expectat ion-maximization (EM) algorithms (IDempster et al 



1977 ) and alternating-direction method of multipliers ( Bovd et al. , 201lh . which have become ubiquitous in 



statistics, machine learning, optimization and other areas. 

Fig. [2] illustrates MAC learning for a sigmoidal deep autoencoder architecture, introducing auxiliary co- 
ordinates for each hidden unit at each layer (see section l4~2l for details). Classical backprop-based techniques 
such as stochastic gradient descent and conjugate gradients need many iterations to decrease the error, but 
each MAC/QP iteration achieves a large decrease, particularly at the beginning, so that it can reach a 
pretty good network pretty fast. While MAC/QP's serial performance is already remarkable, its parallel 
implementation achieves a linear speedup on the number of processors (fig.[6|). 



Stopping criterion Exactly optimizing Eq(W, Z; /x) for each /j, follows the minima path strictly but is 
unnecessary, and one usually performs an inexact, faster optimization. Unlike in a general QP problem, in 
our case we have an accurate way to know when we should exit the optimization for a given /i. Since our 
real goal is to minimize the nested error £?i(W), we exit when its value increases or decreases less than a 
set tolerance in relative terms. Further, as is common in neural net training, we use the validation error 
(i.e., Ei(W) measured on a validation set). This means we follow the path (W*(/x), Z*(/x)) not strictly but 
only inasmuch as we approach a nested minimum, and our approach can be seen as a sophisticated way of 
taking a descent step in Ei(W) but derived from Eq(W, Z; /i). Using this stopping criterion maintains our 
theoretical convergence guarantees, because the path still ends in a minimum of E\ and we drive fi — »• oo. 
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The postprocessing step Once we have finished optimizing the MAC formulation with the QP method, 
we can apply a fast post-processing step that both reduces the objective function, achieves feasibility and 
eliminates the auxiliary coordinates. We simply satisfy the constraints by setting z^n = fk{ z k-i,n] Wfc), 
k = 1, . . . , K, n = 1, . . . , N, and keep all the weights the same except for the last layer, where we set Wjfji 
by fitting fn+i to the dataset (fff(. . . (fi(X))), Y). One can prove the resulting weights reduce or leave 
unchanged the value of E\ (W) . 



2.4 Jointly learning all the parameters in heterogeneous architectures 

Another important advantage of MAC is that it is easily applicable to heterogeneous architectures, where 
each layer may perform a particular type of processing for which a specialized training algorithm exists, 
possibly not based on derivatives over the weights (so that backprop is not applicable or not convenient). 
For example, a quantization layer of an object recognition cascade, or the nonlinear layer of a radial basis 
function (RBF) network, often use a fc-means training to estimate the weights. Simply reusing this existing 
training algorithm as the W-step for that layer allows MAC to learn jointly the parameters of the entire 
network with minimal programming effort, something that is not easy or not possible with other methods. 

Fig. |3] illustrates MAC learning for an autoencoder architecture where both the encoder and the decoder 
are RBF networks, introducing auxiliary coordinates only at the coding layer (see section l4~3l for details). In 
the W-step, the basis functions of each RBF net are trained with fc-means, and the weights in the remaining 
layers are trained by least-squares. As before, MAC/QP achieves a large error decrease in a few iterations. 



2.5 Model selection 

A final advantage of MAC is that it enables an efficient search not just over the parameter values of a 
given architecture, but (to some extent) over the architectures themselves. Traditional model selection 
usually involves obtaining optimal parameters (by running an already costly numerical optimization) for each 
possible architecture, and then evaluating each architecture b ased on a criterion such as cross-validation or 
a Bayesian Information Criterion (BIC), and picking the best ( Hastie et al.l . [2009h . This discrete-continuous 



optimization involves training an exponential number of models, so in practice one settles with a suboptimal 
search (e.g. fixing by hand part of the architecture based on an expert's judgment, or selecting parts separately 
and then combining them). With MAC, model selection may be achieved "on the fly" by having the W-step 
do model selection separately for each layer, and then letting the Z-step coordinate the layers in the usual 
way. Specifically, consider a model selection criterion of the form £i(W) + C(W), where E\ is the nested 
objective function ([1]) and C(W) is additive over the layers of the net: 

C(W) = C 1 (W 1 ) + ---+C K (W K ). (5) 



This is satisfied by many criteria, such as BIC, AIC or minimum description length (jHastie et al. 1. 120091) . m 



which C(W) is essentially proportional to the number of free parameters. While optimizing Ei(W) + C(W) 
directly involves testing M K deep nets if we have M choices for each layer, with MAC the W-step separates 
over layers, and requires testing only MK single-layer nets at each iteration. While these model selection 
tests are still costly, they may be run in parallel, and they need not be run at each iteration. That is, we 
may alternate between running multiple iterations that optimize W for a given architecture, and running a 
model-selection iteration, and we still guarantee a monotonic decrease of Eq(W) + C(W). In practice, we 
observe that a near-optimal model is often found in early iterations. Thus, the ability of MAC to decouple 
optimizations reduces a search of an exponential number of complex problems to an iterated search of a 
linear number of simple problems. 

Fig. [5] illustrates how to learn the architecture with MAC for the RBF autoencoder (see section 14.41 for 
details) by trying 50 different values for the number of basis functions in each of the encoder and decoder (a 
search space of 50 2 = 2 500 architectures). Because, early during the optimization, MAC/QP settles on an 
architecture that is quite smaller than the one used in fig. [31 the result is in fact achieved in even less time. 
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3 Related work 



We believe we are the first to propose the MAC formulation in full generality for nested function learning 
as a provably equivalent, constrained problem that is to be optimized jointly in the space of parameters 
and auxiliary coordinates using quadratic-penalty, augmented Lagrangian or other methods. However, there 
exist several lines of work related to it, and MAC/QP can be seen as giving a principled setting that justifies 
previous heuristic but effective approaches, and opening the door for new, principled ways of training deep 
nets and other nested systems. 

Updating the activations of hidden units separately f r om the weights of a neural net has been done in 
the past , from early work in neural nets ( Grossman et al. . 19881: ISaad and Maroml 1990 ; Krogh et al. . 199C ; 



Rohwerl , 1990) to recent work in learning sparse features (lOlshausen and Field. 1996 . 1997 ; Ranzato et al 



2007b; Kavukcuoglu et al. . 2008) an d dimensionality reduction (jCarreira-Perpinan and Lul. l2008ll2010ll201ll 



Wang and Carreira-Perpinan , 20121) . Interest in using the activations of neural nets as independent variables 



goes back to the early days of neural nets, where learning good i nternal representations was as importan t as 
learning good weights ( Grossman et al. . 1988 ; Saad and Marom . 1990t Krogh et al. . 1990t Rohwer , 1990() . In 
fact, backpropagation was presented as a method to construct good internal representations that represent 
important features of the task domain (jRumelhart et al. . Il986bl) . This necessarily requires dealing explicitly 
with the hidden activations. Thus, while several papers proposed objective functions of both the weights and 
the activations, these were not intended to solve the nested problem or to achieve distributed optimization, 
but to help learn good representations. These algorithms typically did not converge at all, or did not 
converge to a solution of the nested problem, and were developed for a single-hidden-layer net and tested 
in very small problems . More recent variations have similar problems ( Ma et al. . 1997 ; Castillo et all 20061 ; 



Erdogmus et all 120051) . Nearly all this early work has focused on the case of a single hidden layer, which 



is easy enough to train by standard methods, so that no great advantage is obtained, and it does not 
reveal the parallel processing aspects of the problem, which become truly important in the deep net case. 
When extracting features and using overcomplete dictionaries, sparsity is often encouraged, which sometimes 
requires an explicit penalty over the features, but this has only been consid ered for a single layer ( the one that 
extracts the features) and again does not minim ize the nested problem (jOlshausen and Fieldl . 119961 Il997t 
iRanzato et al. . 2007b ; Kavukcuoglu et al. . 20081) . Some work for a single hidden layer net men tions the 
possibility of recovering backpropagation in a limit ( Krogh et al. . 1990t Kavukcuoglu et al. . 2008 ). but this 
is not used to construct an alg orithm that converges to a nested problem optimum. Recent wor ks in deep net 
learning, such as pretrai ning ( Hinton and Salakhutdinov . 20061 ) or greedy layerwise training ( Bengio et al 



2007 : Ngiam et al.l . l201lh . do a single pass over the net from the input to the output layer, fixing the weights 



of each layer sequentially, but without optimizing a joint objective of all weights. While these heuristics can 
be used to achieve good initial weights, they do not converge to a minimum of the nested problem. 

Auxiliary variables have be en used be fore in statistics and machine learning, from early work in fac- 
tor and homogeneity analysis (|Gifil . 
high-dimensional points xi, . . . ,xjy. 



199(ih . 



to learn dimensionality reduction mappings given a dataset of 
Here, one takes the latent coordinates z„ of each data point x„ as 
parameters to be estimated together with the reconstruction mapping f that maps latent points to data 
space and minimize a least-squares error function X^=i ll x « — f (z„ ) || 2 , often by alternating over f an d 
Z. Various nonlinear versions of this approach exist where f is a spline (ILeBlanc and Tibshiranil . 11994 ). 



single-laye r neural net ( Tan and Mavrovouniotis , 1995), radial basis fu nction net ( Smola et al. . 200lh . kernel 



Lawrence! . 120051 ) . However, particularly with nonpara- 



regression ( Meinicke et al. . 20051 ) or Gaussian process 
metric functions, the error can be driven to zero by separating infinitely apart the Z, and so these methods 
need ad- hoc terms on Z to prevent this . The dim ensionality reduction by unsupervised regression ap- 
proach of Carreira-Perpinan and Lul (|2008 . 2010l[2oTlh (generalized to supervised dimensionality reduction in 
IWang and Carreira-Perpinanl 120121 ) solves this by optimizing instead J2n=i ll z « — F(x n )|| 2 + j|x n — f(z„)|| 2 
jointly over Z, f and the projection mapping F (both RBF networks). This can be seen as a truncated 
version of our quadratic-penalty approach, where fi is kept constant, and limited to a single- hidden-layer 
net. Therefore, the resulting estimate for the nested mapping f(F(x)) is biased, as it does not minimize the 
nested error. 

In summary, these works were typically concerned with single-hidden-layer architectures, and did not 
solve the nested problem ([T]). Instead, their goal was to define a different problem (having a different 
solution): one where the designer has explicit control over the net's internal representations or features 







(e.g. to encourage sparsity or some other desired property). In MAC, the auxiliary coordinates are purely 
a mathematical construct to solve a well-defined, general nested optimization problem, with embarrassing 
parallelism suitable for distributed computation, and is not necessarily related to learning good hidden 
representations. Also, none of these works realize the possibility of using heterogeneous architectures with 
layer-specific algorithms, or of learning the architecture itself by minimizing a model selection criterion that 
separates in the W-step. 

Finally, the MAC formu l ation is similar in spirit to th e alternating direction method of multipliers 
(ADMM) (jBovd et al. . 2011 ; Combettes and Pesauet . 2011 ) in that variables (the auxiliary coordinates) 
are introduced that decouple terms. However, ADMM splits an existing variable that appears in multi- 
ple terms of the objective function (which then decouple) rather than a functional nesting, for example 
rnin x /(x) +p(x) becomes min x>y /(x) + g(y) s.t. x = y, or x is split into non-negative and non-positive 
parts. In contrast, MAC introduces new variables to break the nesting. ADMM is known to be very simple, 
effective and parallelizable, and to be able to achieve a pretty good estimate pretty fast, thanks to the 
decoupling introduced and the ability to use existing optimizers for the subproblems that arise. MAC also 
has these characteristics with problems involving function nesting. 



4 Experiments 

Section |4"7E1 describes how we implemented the W- and Z-steps and sections 14.21 14.31 and !4.4l show how MAC 
can learn a homogeneous architecture (deep sigmoidal autoencoder) , a heterogeneous architecture (RBF 
autoencoder) and the architecture itself, respectively. In all cases, we show the speedup achieved with a 
parallel implementation of MAC as well. 



4.1 Optimization of the MAC-constrained problem using a quadratic penalty 

We apply alternating optimization of the QP objective ([3]) over Z and W: 

W-step Minimizing over W for fixed Z results in a separate nonlinear, least-squares regression of the form 
min w fch Y, n =i ( z kh, n - fkh{zk-i,n;wkh)f for k = 1 , . . . , K + 1 (where we define z K +i, n = y„) and 
h = 1, . . . , H , where Wkh is the weight vector (hth row of W&) that feeds into the hth output unit of 
layer k (assuming there are H such units), and Zkh,n the correspon ding scalar target for point x„,. We 



solve each of these KH problems with a Gauss-Newton approach (jNocedal and Wrightl . |2006() , which 
essentially approximates the Hessian by linearizing the function f^, solves a linear system of H x H 
(the number of units feeding into z^h) to get a search direction, does a line search (we use backtracking 
with initial step size 1), and iterates. In practice 1-2 iterations converge with high tolerance. 

Z-step Minimizing over Z for fixed W separates over each z„ for n = 1,...,N. The problem is also 
a nonlinear least-squares fit, formally very similar to those of the W-step, because Z and W enter 
the objective function in a nearly symmetric way through cr(WfeZ/ £ _ 1 ), but with additional quadratic 
terms \\zk,n — . . . |j 2 . We optimize it again with the Gauss-Newton method, which usually spends 1-2 
iterations. 

This optimization of the MAC-constrained problem, based on a quadratic-penalty method with Gauss- 
Newton steps, produces reasonable results and is simple to implement, but it is not intended to be particularly 
efficient. A more efficient optimization can be achieved by (1) using other methods for constrained optimiza- 
tion, such as the augmented Lagrangian method instead of the quadratic penalty method; and (2) by using 
more efficient W- or Z-steps, by combining standard techniques (inexact steps with warm starts, caching 
factorizations, stochastic updates using data minibatches, etc.) with unconstrained optimization methods 
such as L-BFGS, conjugate gradients, gradient descent, alternating optimization or others. Exploring this 
is a topic of future research. 



Parallel implementation of MAC/QP Our parallel implementation of MAC/QP is extremely simple 
at present, yet it achieves large speedups (about 6x faster if using 12 processors), which are nearly linear 
as a function of the number of processors for all experiments, as shown in fig. |6l Given that our code is in 
Matlab, we used the Matlab Parallel Processing Toolbox. The programming effort is insignificant: all we do 
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is replace the "for" loop over weight vectors (in the W-step) or over auxiliary coordinates (in the Z-step) 
with a "parfor" loop. Matlab then sends each iteration of the loop to a different processor. We ran this in a 
shared-memory multiprocessor machine^], using up to 12 processors (a limit imposed by our Matlab license) 
and obtained the results reported in the paper. While simple, the Matlab Parallel Processing Toolbox is 
quite inefficient. Larger spcedups would be achievable with other parallel computation models such as MPI 
in C, and using a distributed architecture (so that cache and other overheads are reduced). 

4.2 Homogeneous training: deep sigmoidal autoencoder 

We use a dataset of handwritten digit images to train a deep autoencoder architecture that maps the input 
image to a low-dimensional coding layer and then tries to reconstruct the image from it. We used MAC/QP 
introducing auxiliary c oordinates for each hidden unit at each layer. Fig. [2] shows the learning curves. 

The USPS dataset (|Hulll . ll994l ). a commonly used machine learning benchmark, contains 16 x 16 grayscale 



images of handwritten digits, i.e., 256D vectors with values in [0, 1]. We use N = 5 000 images for training 
and 5 000 for validation, both randomly selected equally over all digits. 

The autoencoder architecture is 256-300-100-20-100-300-256, for a total of over 200 000 weights, with 
all K = 5 hidden layers being logistic sigmoid units and the output layer being linear. The initial weights 
are uniformly sampled from [— 1/y/fc, l /y/~fk\ for layers k = 1,...,K, respectively, where fk is the input 
dimension (fan-in) to each hidden layer ( Orr and Mullerl . 1 19981 ). When using initial random weights, a large 



(but easy) decrease in the error can be achieved simply by adjusting the biases of the output layer so the 
network output matches the mean of the target data, and all algorithms attain this in the first few iterations, 
giving the impression of a large error decrease followed by a very slow subsequent decrease. To focus the 
plots on the subsequent, more difficult error decreases, we always apply a single step of gradient descent to 
the random initial weights, which mostly adjusts the biases as described. The resulting weights are given as 
initial weights to each optimization method. 

MAC/QP runs the Z-step with 1 Gauss- Newton iteration and the W-step with up to 3 Gauss- Newton 
iterations. We also use a small regularization on W in the first few iterations, which we drop once fi > 10 4 , 
since we find that this tends to lead to a better local optimum. For each value of /i, we optimize Eq(W, Z; [a) 
in an inexact way as described in section [21 exiting when the value of the nested error i?i(W), evaluated on 
a validation set, increases or decreases less than a set tolerance in relative terms. We use a tolerance of 10~ 2 
and increase \i rather aggressively, to 10/^. 

We show the learning curves for two classical backprop-based techniques, stochastic gradient descent and 
conjugate gradients. Stochastic gradient descent (SGD) has several parameters that should be carefully set 
by the user to ensure that convergence occurs, and that convergence is as fast as possible. We did a grid search 
for the minibatch size in {1, 10, 20, 50, 100, 200, 500, 1000, 5000} and learning rate in {1, 10, 100, 1000} x 10" 7 . 
We found minibatches of 20 samples and a learning rate of 10 -6 were best. We randomly permute the train- 
ing set at the beginning of each epo ch. For nonlinear con j ugate gradients (CG), we used the Polak-Ribierc 
version, widely regarded as the best ( Nocedal and Wright . 20061 ). We use Carl Rasmussen's implementation 



minimize .m (available at http: / /learning . eng . cam. ac . uk/ carl/ code/minimize/minimize .ml, which uses 



a line search based on cubic interpolation that is more sophisticated than backtracking, and allows steps 
longer than 1. We found a minibatch of N (i.e., batch mode) worked best. We used restarts every 100 steps. 

Figure [2jleft) plots the mean squared training error for the nested objective function ((TJ) vs run time for 
the different algorithms after the initialization. The validation error follows closely the training error and 
is not plotted. Markers are shown every iteration (MAC), every 100 iterations (CG), or every 20 epochs 
(SGD, one epoch is one pass over the training set). The change points of the quadratic penalty parameter 
\i are indicated using filled red circles at the beginning of each iteration when they happened. The learning 
curve of the parallelized version of MAC (using 12 processors) is also shown in blue. Fig. [2jright) shows 
some USPS images and their reconstruction at the end of training for each algorithm. 

SGD and CG need many iterations to decrease the error, but each MAC/QP iteration achieves a large 
decrease, particularly at the beginning, so that it can reach a pretty good network pretty fast. While 
MAC/QP's serial performance is already remarkable, its parallel implementation achieves a linear speedup 
on the number of processors (fig. [B]). 



1 An Aberdeen Stirling 148 computer having 4 physical CPUs (Intel Xeon CPU L7555@ 1.87GHz), each with 8 individual 
processing cores (thus a total of 32 actual processors), and a total RAM size of 64 GB. 
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runtime (hours) 



Figure 2: Training a deep autoencoder to reconstruct images of handwritten digits. Left: nested function 
error ([1]) for each algorithm, with markers shown every iteration (MAC), every 100 iterations (CG), or 
every 20 epochs (SGD). For MAC/QP, we incremented the quadratic penalty fi as indicated at the red 
solid markers. All these experiments were run in the same computer using a single processor, except the 
parallel MAC training curve, which used 12 processors sharing the same memory using the Matlab Parallel 
Processing Toolbox. Right: reconstruction of sample training images by different methods. 



4.3 Heterogeneous training: radial basis functions autoencoder 

We use a dataset of object images to train an autoencoder architecture where both the encoder and decoder 
are RBF networks but, rather than using a gradient-based optimization for each subnet, we use fc- means 
to train the basis functions of each RBF in the W-step (while the weights in the remaining layers are 
trained by least-squares). We used MAC/QP introducing auxiliary coordinates only at the coding layer. 
Backprop-based algorithms are incompatible with fc-mcans training, so instead we compare with alternating 

optimization. Fig. [3] shows the le arning curve s. 

The COIL-20 image dataset ( Nene et al. . 19961) . commonly used as a benchmark to test dimensionality 



reduction algorithms, contains rotation sequences of 20 different objects every 5 degrees (i.e., 72 images per 
object), each a grayscale image with pixel intensity in [0, 1]. We resize the images to 32 x 32. Thus, the data 
contain 20 closed, nonlinear ID manifolds in a 1 024-dimensional space. We pick half of the images from 
objects 1 (duck) and 4 (cat) as validation set, which leaves a training set containing N = 1 368 images. 

The autoencoder architecture is as follows. The bottleneck layer of low-dimensional codes has only 2 
units, so that we can visualize the data. The encoder reduces the dimensionality of the input image to 2D, 
while the decoder reconstructs the image as best as possible from the 2D representation. Both the encoder 
and the decoder are radial basis function (RBF) networks, each having a single hidden layer. The first one 
(encoder) has the form z = f 2 (fi(x; Wi); W 2 ) = W 2 fi(x; Wi), where the vector fi(x; Wi) has Mi = 1 368 
elements (basis functions) of the form exp (— ||(x — wh)/<ti|| 2 ), i = 1, . . . , Mi, with u\ = 4, and maps an 
image x to a 2D space z. The second one (decoder) has the form x' = f4(f 3 (z; W 3 ); W 4 ) = W 4 f 3 (z; W 3 ), 
where the vector f 3 (z; W 3 ) has M 3 = 1 368 elements (basis functions) of the form exp (— || (x — w 3i )/er 3 || 2 ), 
i = 1, . . . , M 3 , with <7 3 = 0.5, and maps a 2D point z to a 1 024D image. Thus, the complete autoencoder is 
the concatenation of the two Gaussian RBF networks, it has K = 3 hidden layers with sizes 1 024-1 368-2- 
1 368-1 024, and a total of almost 3 million weights. As is usual with RBF networks, we applied a quadratic 
regularization to the linear-layer weights with a small value (A 2 = A4 = 10~ 3 ). The nested problem is then 
to minimize the following objective function, which is a least-squares error plus a quadratic regularization 
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Figure 3: Training an RBF autoencoder to reconstruct images of objects. Left: nested function error |TJ for 
each algorithm, with markers shown every iteration (MAC, alternating optimization). Other details as in 
fig- EJ Right: reconstruction of sample training images by different methods. 




Figure 4: Values of the Z coordinates (2D latent space) for the COIL-20 data set. Left: initialization obtained 
from the elastic embedding algorithm. Right: results from MAC at the end of training (section [4. 3p . 



on the linear-layer weights: 
1 N 

E ^ W ) - 2^ l|yn_f(x " ;W)l|2 + A2!|W2!l ^ + A4||W411 ^ f(x;W) =f 4 (f 3 (f 2 (f 1 (x;W 1 );W 2 );W 3 );W 4 ). 

In practice, RBF networks arc trained in two stages (|Bishop| . l2006f ). Consider the encoder, for example. First, 
one trains the centers Wi using a clustering algorithm applied to the inputs {x„}^ =1 , typically fc-means or 
(when the number of centers is large) simply by fixing them to be a random subset of the inputs. Second, 
having determined Wi, one obtains W 2 from a linear least-squares problem, by solving a linear system. 
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The reason why this is preferred to a fully nonlinear optimization over centers Wi and weights W2 is that 
it achieves near-optimal nets with a simple, noniterative procedure. This type of two-stage noniterative 
strategy to obtai n nonlinear networks i s wid ely applied be yond RBF networks, fo r example with support 
vect or ma chines (jScholkopf and Smolal . l200lh . kernel PC A (IScholkopf et all . Il998ft . slice inverse regression 



(0, Il99lh and others. 

We wish to capitalize on this attractive property to train deep autoencoders constructed by concatenating 
RBF networks. However, backprop-based algorithms are incompatible with this two-stage training procedure, 
since it does not use derivatives to optimize over the centers. This leads us to the two following optimization 
methods: an alternating optimization approach, and MAC. 

We can use an alternating optimization approach where we alternate the following two steps: (1) A 
step where we fix (Wi, W2) and train (W3, W4), by applying fc-means to W3 and a linear system to W4. 
This step is identical to the W-step in MAC over (Wi, W2). (2) A step where we fix (W3, W4) and train 
(Wi, W2), by applying fc-means to Wi and a nonlinear optimization to W2 (we use nonlinear conjugate 
gradients with 10 steps). This is because W2 no longer appears linearly in the objective function, but is 
nonlincarly embedded as the argument of the decoder. This step is significantly slower than the W-step in 
MAC over (W 3 ,W 4 ). 

We define the MAC-constrained problem as follows. We introduce auxiliary coordinates only at the coding 
layer (rather than at all K = 3 hidden layers). This allows the W-step to become the desired fc-means plus 
linear system training for the encoder and decoder separately. It requires no programming effort; we simply 
call an existing, fc-means-based RBF training algorithm for each of the encoder and decoder separately. We 
start with a quadratic penalty parameter [i = 1 and increase it to fi = 5 after 70 iterations. 

Since we use as many centers as data points (Mi = M3 = N), the fc-means step simplifies (for both 
methods) to setting each basis function center to an input point. 

In this experiment, instead of using random initial weights, we obtained initial values f or the Z coordinates 

by ru nning a nonlinear dimensionality reduction method, the elastic embedding (EE) (ICarreira-P erpinaii , 

2010 ): this gives significantly better embeddings than spectral methods; ( Tenenbaum et all 2000l : Roweis and Saul . 



20001 ) . This takes as input a matrix of N x N similarity values between every pair of COIL images xj, . . . , xjy, 
and produces a nonlinear projection z„ in 2D for each x„. We used Gaussian similarities with a kernel width 
of 10 and run EE for 200 iterations using A = 100 as its user parameter. All the optimization algorithms 
were initialized from these projections. 

Fig.[3]deft) shows the nested function error © (normalized per data point, i.e., divided by N). As before, 
MAC/QP achieves a large error decrease in a few iterations. Alternating optimization is much slower. Again, 
a parallel implementation of MAC/QP achieves a large speedup, which is nearly linear on the number of 
processors (fig. [6]). Fig. fright) shows some COIL images and their reconstruction at the end of training 
for each algorithm. Fig. |4] shows the initial projections Z (from the elastic embedding algorithm) and the 
final projections Z (after running MAC/QP). Most of the manifolds have improved, for example opening up 
loops that were folded. 

4.4 Learning the architecture: RBF autoencoder 

We repeat the experiment of the RBF autoencoder of the previous section, but now we learn its architecture. 
We jointly learn the architecture of the encoder and of the decoder by trying 50 different values for the 
number of basis functions in each (a search space of 50 2 = 2 500 architectures). We define the following 
objective function over architectures and their weights: 

E(W) =E 1 (W) + C(W) (7) 

where £7i(W) is the nested error from eq. © (including regularization terms), and C(W) = C(W 1) + • • • + 
C(W4,) is the model selection term. We use the well-known AIC criterion (jHastie et al. , l2009h . This is 
defined as 

E(&) = SSE(0) + C(0) C(0)=2e 2 |©| (8) 

(times a constant jj, which we omit) where SSE(0) is the sum of squared errors achieved in the training 
set with a model having parameters 0, e 2 is the mean squared error achieved by a low-bias model (typically 
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Figure 5: Learning the architecture of the RBF autoencoder for the dataset of fig. [3] using MAC. We show 
the total error Ei(W) + C(W) (the nested function error (TTJ) plus the model cost) per point. Model selection 
steps are run every 10 iterations and are indicated with green markers (solid if the architecture changes and 
empty if it does not change). Other details as in fig. [3] 
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number of processors 

Figure 6: Parallel processing speedup of MAC/QP as a function of the number of processors for each of the 
experiments of figures [2j |3j and [5l The speedup in all our experiments was approximately linear, reaching 
about 6x with 12 processors. 



estimated by training a model with many parameters) and |0| is the number of free parameters in the model. 
In our case, this means we use C(W) defined as 

C(W) = 2e 2 |W| = 2e 2 {DM 1 + LM\ + LM 3 + DM 3 ) = 2e 2 (D + L)(Mi + M 3 ) (9) 

where All and M3 are the numbers of centers for the encoder and decoder, respectively (first and third 
hidden layers), and D = 1024 and L = 2 are the input and output dimension of the encoder, respectively 
(equivalently, the output and input dimension of the decoder) . The total number of free parameters (centers 
and linear weights) in the autoencoder is thus |W| = {D + L){M\ + M3). 

We choose each of the numbers of centers Mi and M3 from a discrete set consisting of the 50 equispaced 
values in the range 150 to 1368 (a total of 50 2 = 2 500 different architectures). We estimated e 2 = 0.05 
from the result of the RBF autoencoder of section 14.31 which had a large number of parameters and thus 
a low bias. As in that section, the centers of each network are constrained to be equal to a subset of the 
input points (chosen at random). We set <j\ = 4, 03 = 2.5 and Ai = A2 = 10~ 3 . We start the MAC/QP 
optimization from the most complex model, having Mi = M3 = 1 368 centers (i.e., the model of the previous 
section). While every iteration optimizes the MAC/QP objective function (J3j) over (W, Z), we run a model 
selection step only every 10 iterations. This selects separately for each net the best Mk value and potentially 
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changes the size of W. Thus, every 11th iteration is a model selection step, which may or may not change 
the architecture. 

Figure [5] shows the total error E(W) = Ei(W) + C(W) of eq. (J7J (the nested function error plus the 
model cost). Model selection steps are indicated with green markers (solid if the architecture did change and 
empty if it did not change), annotated with the resulting value of {M\,Mz). Other details are as in fig. [3] 
The first change of architecture moves to a far smaller model (Mi, M3) = (700, 150), achieving an enormous 
decrease in objective. This is explained by the strong penalty that AIC imposes on the number of parameters, 
favoring simpler models. Then, this is followed by a few minor changes of architecture interleaved with a 
continuous optimization of its weights. The final architecture has (Mi,M3) = (1368,150), for a total of 
1.5 million weights. While this architecture incurs a larger training error than that of the previous section, 
it uses a much simpler model and has a lower value for the overall objective function of eq. ([7]). Because, 
early during the optimization, MAC/QP settles on an architecture that is quite smaller than the one used 
in fig. [3l the result is in fact achieved in even less time. And, again, the parallel implementation is trivial 
and achieves an approximately linear speedup on the number of processors (fig.|H])- 

5 Conclusion 

MAC drastically facilitates, in runtime and human effort, the practical design and estimation of nonconvex, 
nested problems by jointly optimizing over all parameters, reusing existing algorithms, searching automati- 
cally over architectures and affording massively parallel computation, while provably converging to a solution 
of the nested problem. It could replace or complement backpropagation-based algorithms in learning nested 
systems both in the serial and parallel settings. It is particularly timely given that serial computation is 
reaching a plateau and cloud computing is becoming a commodity, and intelligent data processing is finding- 
its way into mainstream devices (phones, cameras, etc.), thanks to increases in computational power and 
data availability. An important area of application may be the joint, automatic tuning of all stages of a 
complex, intelligent-processing system in data-rich disciplines, such as those found in computer vision and 
speech, in a distributed cloud computing environment. MAC also opens many questions, such as the optimal 
way to introduce auxiliary coordinates in a given problem, and the choice of specific algorithms to optimize 
the W- and Z-steps. 
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A Theorems and proofs 
A.l Definitions 

Consider a regression problem of mapping inputs x to outputs y (both high-dimensional) with a deep net 
f (x) given a dataset of N pairs (x„, y„). We define the nested objective function to learn a deep net with K 
hidden layers, like that in fig. [TJ as (to simplify notation, we ignore bias parameters and assume each hidden 
layer has H units): 

1 N 

^i( W ) = 2Elly«- f ( x - W )H 2 f(x;W)=f Jf +i(...f a (fi(x;Wi);W 3 )...;Wi f+ i) © 

n=l 

where each layer function has the form ffc(x; Wfc) = cr(Wfex), i.e., a linear mapping followed by a squashing 
nonlinearity (<r(t) applies a scalar function, such as the sigmoid 1/ (1 + e - *), elementwise to a vector argument, 
with output in [0, 1]). 

In the method of auxiliary coordinates (MAC), we introduce one auxiliary variable per data point and per 
hidden unit (so Z = (Zi, . . . , Zjv), with z n = (zi )Tl , . . . , z^ n )) and define the following equality-constrained 
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optimization problem: 



i?(W,Z) = i^||y„-f i , +1 (z, f ,„;W if+1 )|| 2 s.t. \.«' n KK K ^U = l,...,iV. © 

[zi,n = fi(x„;W 1 ) J 

Sometimes, for notational convenience (in particular in theorem IB.3[) . we will write the constraints for the 
nth point as a single vector constraint z n — F(z„, W;x„) = (with an obvious definition for F). We will 
also call fl the feasible set of the MAC-constrained problem, i.e., 

fi = {(W,Z): z„ = F(z„,W;x„), n=l,...,N}. (10) 



To solve the constrained problem ([2]) using the quadratic-penalty (QP) method ( Nocedal and Wrightl . 



2006), we optimize the following function over (W, Z) for fixed fi > and drive fi — > oo: 

N N K 

E Q (W, Z;/i) = - E ||y„ - ^(z^; W K+1 )f + § E E K» ~ to-i.nl W fc )|| 2 . © 

n— 1 n— 1 fc— 1 

A. 2 Equivalence of the MAC and nested formulations 

First, we give a theorem that holds under very general assumptions. In particular, it does not require the 
functions to be smooth, it holds for any loss function beyond the least-squares one, and it holds if the nested 
problem is itself subject to constraints. 

Theorem A.l. The nested problem (fT]) and the MAC-constrained problem ([2]) are equivalent in the sense 
that their minimizers are in a one-to-one correspondence. 

Proof. Let us prove that any minimizer of the nested problem is associated with a un ique minimizer of the 



MAC -constrained problem (=>), and vice versa (^). Recall the following definitions (jNocedal and Wright 



20061 ): (i) For an unconstrained minimization problem min xe n" -F^x), x * G R" is a local minimizer if there 



exists a nonempty neighborhood N C R" of x* such that F(x*) < F(x) Vx g N. (ii) For a constrained 
minimization problem minF(x) s.t. x G f! C R", x* g R" is a local minimizer if x* g fl and there exists a 
nonempty neighborhood A" C R™ of x* such that F(x*) < F(x) Vx g N H CI. 

Define the "forward-propagation" function g(W) as the result of mapping zi >n = fi(x„; Wi), . . . , Zpc, n = 
fW(z/<--i.n; Wjf) for n = 1,...,N. This maps each W to a unique Z, and satisfies {K+i(zK.n', Wa'+i) = 
f K+1 (. . . f 2 (fi(x„; Wi); W 2 ) . . . ; W K+1 ) = f (x„; W) for n = 1, . . . , N, and therefore that E^W) = E(W, g(W)) 
for any W. 

(=>) Let W* be a local minimizer of the nested problem ([1]). Then, there exists a nonempty neighborhood 
M of W* such that E X (W*) < E X (W) VW g A/". Let Z* = g(W*) and call M = {(W, Z): W g A" and Z = 
g(W)}, which is a nonempty neighborhood of (W*, Z*) in (W, Z)-space. Now, for any (W, Z) g M. DA/" we 
have that E(W,Z) = F(W,g(W)) = S X (W) > Fi(W*) = F(W*,g(W*)) = E{W*,Z*). Hence (W*,Z*) 
is a local minimizer of the MAC-constrained problem. 

(<=) Let (W*,Z*) be a local minimizer of the MAC-constrained problem ([2]). Then, there exists a 
nonempty neighborhood M of (W*,Z*) such that E(W*,Z*) < E(W,Z) V(W,Z) g M n O. Note that 
(W, Z)gA4nfi^>Z = g(W) ^> S(W, Z) = Ei(W), and this applies in particular to (W*, Z*) (which, 
being a solution, is feasible and thus belongs to M. D f2). Calling A/" = {W: (W, Z) g M. n fi}, we have that 
VW g A": £i(W) = E(W,g(W)) = E(W,Z) > E(W*,Z*) = E(W*,g(W*)) = Ei(W*). Hence W* is a 
local minimizer of the nested problem. 

Finally, one can see that the proof holds if the nested problem uses a loss function that is not the 
least-squares one, and if the nested problem is itself subject to constraints. □ 

Obviously, the theorem holds if we exchange > with > everywhere (thus exchanging non-strict with 
strict minimizers), and if we exchange "min" with "max" (hence the maximizcrs of the MAC and nested 
formulations are in a one-to-one correspondence as well). Figure [7] illustrates the theorem. Essentially, the 
nested objective function E\(W) stretches along the manifold defined by (W, Z = g(W)) preserving the 
minimizers and maximizers. The projection on W-spacc of the part of E(W, Z) that sits on top of that 
manifold recovers E\ (W) . 
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Figure 7: Illustration of the equivalence between the nested and MAC-constrained problems (see the proof of 
theorem I A. The MAC objective function E(W, Z) is shown with contour lines in the (W, Z)-space, and 
with the vertical red lines on the feasible set (W,g(W)). The nested objective function Ei(W) is shown in 
blue. Corresponding minima for both problems, W* and (W*, Z*), are indicated. 



A.3 KKT conditions 

We now show that the first-order necessary (Karush-Kuhn- Tucker, KKT) conditions of both problems (nested 
and MAC-constrained) have the same stationary points. For simplicity and clarity of exposition, we give a 
proof for the special case of K = 1. The proof for K > 1 layers follows analogously. We assume the functions 
fi and {2 have continuous first derivatives w.r.t. both its input and its weights. Jf 2 (-;W2) indicates the 
Jacobian of f 2 w.r.t. its input. To simplify notation, we sometimes omit the dependence on the weights; for 
example, we write f2(fi(x; Wi); W2) as f2(fi(x)), and Jf 2 (-; W2) as Jf 2 (-). 

Theorem A. 2. The KKT conditions for the nested problem (TTJ) and the MAC-constrained problem (J2J are 
equivalent. 

Proof. The nested problem for a nested function f2(fi(x)) is: 

1 N 

mm Ex ( Wx , W 2 ) = - V 1 1 y n - f 2 (f x (x„ ; W 1 ) ; W 2 ) 1 1 2 . 

Wi,Wj Z * — ' 

n—1 

Then we have the stationary point equation (first-order necessary conditions for a minimizcr): 




which is satisfied by all the minima, maxima and saddle points. 
The MAC-constrained problem is 

1 N 

min £(Wi,W 2 ,Z) = - V ||y„ - f 2 (z n ; W 2 )|| 2 s.t. z n = f^x*; Wi), n=l,...,N, 

Wi,W2.Z z 

n—1 
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with Lagrangian 



1 N N 

£(Wi,W 2 , Z, A) = - lly« - f 2 (z„; W 2 )j| 2 - £ A^(z n - f 1 (x„; W x )) 



and KKT conditions 



2 ^ 

ri—1 n— 1 



^-E^W».-0 (13) 



Tl=l 



iV ^ pT 



E^( f i( x «))^- f ^w) = ° (14) 



aw 2 <h <9w 2 

dz n 

z n =f l (x n ;W 1 ), »=1,...,JV. (16) 



-Jf 2 (z n ) T (y„-f 2 (z„))-A n = 0, n =l,...,iV (15) 



Substituting A„ from eq. (|T5j) and z„ from eq. (|T6|) : 

A„ = -J f2 (z„) T (y„-f 2 (z„)), n = l,...,JV CSJ) 
Z „ = f 1 (x n ;W 1 ), n=l,...,JV CU) 

into eqs. <jT2J) ~ (H3J) we recover eqs. (fTT j) - (fT2"]) . thus a KKT point of the constrained problem is a stationary 
point of the nested problem. Conversely, given a stationary point (Wi, W 2 ) of the nested problem, and 
defining A„ and z„ as in eqs. (|15[p - (|16[p . then (Wi, W 2 , Z, A) satisfies eqs. (fT3]) -([T6 |) and so is a KKT point 
of the constrained problem. Hence, there is a one-to-one correspondence between the stationary points of 
the nested problem and the KKT points of the MAC-constrained problem. □ 

From theorem IA.1I and IA.21 it follows that the minimizers, maximizers and saddle points of the nested 
problem are in one-to-one correspondence with the respective minimizers, maximizers and saddle points of 
the MAC-constrained problem. 



B Convergence of the quadratic-penalty method for MAC 

Let us first give convergence conditions for the general equality-constrained minimization problem: 

min /(x) s.t. Cj(x) = 0, i = 1, . . . , m (17) 
and the quadratic-penalty (QP) function 

Q(x ;/ ,)=/(x) + |^c|(x) (18) 

i=i 

with penalty parameter fi > 0. Given a positive increasing sequence — > oo, a nonnegative sequence 
(t^) — > 0, and a starting point xo, the QP method finds an approximate minimizer of Q(x;fik) for 
k = 1,2,..., so that the iterate x^ satisfies || V x Q(xfc; ^fc)|| < ifc. Given this algorithm, we have the 
following theorems: 



Theorem B.l (jNocedal and Wrightl . [200l Th. 17.1). Suppose that {^,k) ~^ 00 and (r^) — > 0. // each x/. is 



the exact global minimizer o/Q(x; fi^), then every limit point x* of the sequence (x&) is a global solution of 
the problem (|17[) . 

Theorem B.2 dNocedal and Wrightl . [200l Th. 17.2). Suppose that ([x k ) — > 00 and (rfc) — ¥ 0, and that x* 
is a limit point of (x&). Then x* is a stationary point of the function 21= 1 c ?( x )- Besides, if the constraint 
gradients Vci(x*), i = 1, ...,m are linearly independent, then x* is a KKT point for the problem (|17p . 
For such points, we have for any infinite subsequence JC such that limfegjc x& = x* that limfcgx; — fJ-kCi(x.k) 
X" . i = 1, . . . , m, where A* is the multiplier vector that satisfies the KKT conditions for the problem (|17|) . 
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If now we particularize these general theorems to our case, we can obtain stronger theorems. Theorem lB.il 
is generally not applicable, because optimization problems involving nested functions are typically not convex 
and have local minima. Theorem IB. 2 1 is applicable to prove convergence in the nonconvex case. We assume 
the functions fi, . . . , %k+i in eq. ([T]) have continuous first derivatives w.r.t. both its input and its weights, so 
E(W, Z) is differentiable w.r.t. W and Z. 

Theorem B.3 (Convergence of MAC/QP for nested problems). Consider the constrained problem © and 
its quadratic-penalty function Eq(W,Z; /j,) of ([3]). Given a positive increasing sequence (pk) oo, a, 
nonnegative sequence (rfc) — > 0, and a starting point (W , Z ), suppose the QP method finds an approximate 
minimizer (W fe , Z k ) of E Q (W k , Z k ; fi k ) that satisfies || V w ,z-BQ(W fe , Z fc ; fi k )\\ < r fc for k = 1, 2, . . . Then, 
lini/ c _ i . 0O (W fe , Z k ) = (W*, Z*) ; which is a KKT point for the problem @ 7 and its Lagrange multiplier vector 
has elements A* = lim^oo — fi k (Z k — F(Zj;, W fe ; x„)), n — 1, . . . , N. 

Proof. It follows by applying theorem lB.2l to the constrained problem @ and by noting that lim^oc (W k , Z k ) 
(W*, Z*) exists and that the constraint gradients are linearly independent. We prove these two statements 
in turn. 

The limit of the sequence ((W fc ,Z fe )) exists because the objective function E(W, Z) of the MAC- 
constraincd problem (hence the QP function Eq(W, Z; fi)) are lower bounded and have continuous deriva- 
tives. 

The constraint gradients are Li. at any point (W, Z) and thus, in particular, at the limit (W*,Z*). 
To see this, let us first compute the constraint gradients. There is one constraint C n fc/i(W, Z) = z nk h — 
fkh{ z n,k-i] Wfe) = for each point n = 1, . . . ,N, layer k = 1, . . . , K and unit h <= I(fc), where we define 
X(fc) as the set of auxiliary coordinate indices for layer k and z„ = x„, n = 1, . . . ,N. The gradient of this 
constraint is: 

dC n kh r- dfkh , 1 r ^ 

aw = -t>kk> aw , K = L,...,K 

dW k , dW k 

kh f <9 fkl \ 

o — = <W [ hk'Shh' - 4-i.fe'T; — I , n = 1, . . . , N, k = l,...,K, he l{k). 

OZn'k'h' \ OZ n ,k-X,hJ 

Now, we will show that these gradients are Li. at any point (W, Z). It suffices to look at the gradients w.r.t. 
Z. Call O-nkh = dfkh/dz nj k-i,h for short. Constructing a linear combination of them and setting it to zero: 



N K 
n=l k=l h£l(k) 



dC n kh 

dZ' 



This implies, for the gradient element corresponding to z n ik>h' : 

N K 

/~] /"] ^ KikhSnn' (fikk'Shh' — Sk-lM'Ctjikh) = Ki'k'h' — \i',W+l,h ( *n',k'+l,h 
n=l fc=l /iel(fe) h&X{k'+l) 



heX(k'+l) 

Applying this for k' = K, . . . , 1: 

• For fc' = iT: X n 'Kh' =0, n' = 1, . . . , N, h' € 1{K). 

• For k' = K - 1: A„/. A '-i,h' = J2hei(K) ^ri ,K,h°i-n' ,K,h = 0, n' = 1, . . . ,N, hi G Z(A' - 1). 

• ... 

• For fc' = 1: \ n '.\M = J2hei(2) An' ,2,h<*n' ,2,h = 0, n' = 1,...,N, h! € 1(1). 

Hence, all the coefficients A n fe/j are zero and the gradients arc Li. □ 
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In practice, as with any continuous optimization problem, convergence may occur in pathological cases 
to a stationary point of the constrained problem rather than a minimizer. 

In summary, MAC/QP defines a continuous path (W*(/i), Z*(/i)) which, under some mild assumptions 
(essentially, that we minimize Eq(W , Z; /i) increasingly accurately as fi — > oo), converges to a stationary 
point (typically a minimizer) of the constrained problem @, and thus to a minimizer of the nested prob- 
lem (TTJ). 

We also have the following result (for simplicity, we give it for K = 1 layer). 

Theorem B.4. // a stationary point of the QP function for the problem of theorem \A.S\ satisfies z„ = 
fi(x n ; Wi), then it is also a stationary point of the nested problem for all /i > 0. 

Proof. The QP function is: 

N N 

Eq(W u W 2 ,Z; M ) = - J2 lly« - f 2 (z„; W 2 )|| 2 + f£ ||z„ - ri (x„; W x )|| 2 . 

n— 1 n— 1 

A stationary point of Eq must satisfy the equations: 
f) F N f)f T 

9W: = -^^k {Xn)iZn - iliXn)) = (19) 

n— 1 

li: = -E^( z «)(y«-f 2 ( Z «)) = o (20) 

71=1 

Jf 2 (z„) T (y„-f 2 (z„)) + /i(z„ -fi(x„)) =0, n=l,...,N. (21) 



<9z„ 



If z n = fi(x„; Wi) for n = 1,...,N, then from eq. (f2~Tj) Jf 2 (z„) T (y„ — f 2 (z„)) = for n = 1, . . . , N and 
(Wi, W 2 ) satisfies cqs. (fTT jl -lfT!? ]) . so it is a stationary point of the nested problem for all /j > 0. □ 

Remarks: 

• Since the QP minimizer approaches the constraints from their infeasible side, the assumption z n = 
fi(x n ; Wi) for n = 1, . . . , N does not hold unless Ei(Wi, W 2 ) = there. 



• The converse of theorem IB. 41 is not generally true: if (Wi,W 2 ) is a stationary point of the nested 
problem, then defining z„ = fi (x„; Wi) for n = 1, . . . , N we have that eqs. (fT9|) - ([20|) hold but eq. (|2T|) 
does not. 

• Theorem lB .41 docs not imply that the function Eq(W, Z; /i) (for [i > fi) is an exact penalty function for 
the objective ^(W, Z), for this we need the opposite: that any local solution of the MAC-constrained 
problem is a local minimizer of Eq(W, Z; ji). 
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